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ABSTRACT 

We describe the use of Graphics Processing Units (GPUs) for speeding up the code nbodyq 
which is widely used for direct TV-body simulations. Over the years, the iV 2 nature of the 
direct force calculation has proved a barrier for extending the particle number. Following an 
early introduction of force polynomials and individual time-steps, the calculation cost was 
first reduced by the introduction of a neighbour scheme. After a decade of GRAPE computers 
which speeded up the force calculation further, we are now in the era of GPUs where relatively 
small hardware systems are highly cost-effective. A significant gain in efficiency is achieved 
by employing the GPU to obtain the so-called regular force which typically involves some 99 
percent of the particles, while the remaining local forces are evaluated on the host. However, 
the latter operation is performed up to 20 times more frequently and may still account for a 
significant cost. This effort is reduced by parallel SSE/AVX procedures where each interaction 
term is calculated using mainly single precision. We also discuss further strategies connected 
with coordinate and velocity prediction required by the integration scheme. This leaves hard 
binaries and multiple close encounters which are treated by several regularization methods. 
The present nbody6-gpu code is well balanced for simulations in the particle range 10 4 — 
2 x 10 5 for a dual GPU system attached to a standard PC. 
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1 INTRODUCTION 

The quest to perform efficient iV-body calculations has challenged 
astronomers and computer scientists ever since the early 1960s. For 
a long time progress was slow but so was the increase in computing 
power. The first significant advance was achieved by the Ahmad- 
Cohen (1973, AC) neighbour scheme which splits the total force 
into a distant slowly changing part and a local contribution with 
shorter time-scale, hereafter denoted as the regular and irregular 
force. Considerable progress on the hardware side was made when 
the GRAPE-type special-purpose computers were developed in the 
early 1990s (Makino, Kokubo & Taiji 1994) and later improved to 
GRAPE-4 and GRAPE-6 (Makino et al. 2003). More recently, the 
general availability of Graphics Processing Units (GPUs) and the 
corresponding CUDA programming language have facilitated large 
gains in A^-body simulations at modest cost. Early applications 
based on CUDA (Nyland, Harris & Prins 2007, Belleman, Bedorf 
& Portegies Zwart 2008) demonstrated significant speeding-up, ap- 
proaching GRAPE-6 performance. Moreover, the introduction of 
pseudo double precision for the coordinate differences without sac- 
rificing much efficiency (Nitadori 2009) ensured increased confi- 
dence in the results. In other problems, the employment of CUDA 
has already led to Petaflop performance by combining several thou- 
sand GPUs. 



In this paper, we are mainly concerned with small stand-alone 
systems using one or two GPUs with the code nbody6-Gpu. 
However, mention should also be made of the parallel version 
nbody6 + + which is capable of reaching somewhat larger parti- 
cle numbers using several types of hardware (Spurzem 1999) and 
is also intended for GPUs. 

This paper is organized as follows. After reviewing some rel- 
evant aspects in the standard nbodyq code we discuss the new 
treatment of the regular and irregular force. As a result of improve- 
ments in the regular force calculation, the irregular force now be- 
comes relatively expensive. Although the latter may also be evalu- 
ated on the GPU, the overheads are too large. Instead we perform 
this calculation in parallel using SSE0 (Streaming SIMD Exten- 
sions) and OpenMP in C++ with GCC built-in functions. The pres- 
ence of hard binaries requires careful attention, especially because 
the regularization scheme involves different precision for the cen- 
tre of mass (cm.) motion as evaluated by FORTRAN and SSE. A 
later implementation with AVX accelerated the irregular force cal- 
culation. The performance gain is illustrated by comparison with 
the basic version at relatively small particle numbers and we esti- 
mate the cost of doing large simulations for two types of hardware. 
Finally, we summarize current experience with small GPU systems 
and point to possible future developments. 
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2 BASIC NBODY6 CODE 

The code nbody6 was developed during the late 1990s (Aarseth 
1999). It was based on a previous code nbodys which also em- 
ployed the AC neighbour scheme. Here the main improvement was 
to replace the fourth-order Adams method by an equivalent Hermite 
formulation (Makino 1991) for the case of two force polynomials 
(Makino & Aarseth 1992). An intermediate step was made with 
the Hermite individual time-step code nbodya (Aarseth 1996) de- 
signed for use by several generations of GRAPE-type computers. 

The combination of Hermite integration with block-steps has 
proved a powerful tool in A-body simulations. At earlier times 
its simple form was beneficial for using together with the special- 
purpose GRAPE hardware which supplies the force and its first 
derivative. This allows for the construction of an efficient and accu- 
rate fourth-order integration method where the introduction of hi- 
erarchical block-steps reduces the overheads of coordinate and ve- 
locity predictions considerably and facilitates parallel procedures. 
Experience has also shown that the Hermite AC block-step scheme 
is highly cost-effective in the standard nbody6 code. 

On conventional computers, the regular force calculation 
dominates the CPU time, with only a weak dependence on the 
neighbour strategy. Thus there are compensating factors when the 
number of neighbours (n^) is varied. The new neighbours are cho- 
sen at the time of a regular force calculation. Hence all particles in- 
side the corresponding neighbour radius are selected, together with 
any particles in an outer shell approaching with small impact pa- 
rameter. Individual neighbour radii i? s are adjusted according to the 
local density contrast, with additional modifications near the upper 
and lower boundary. New values are obtained by the expression 

jc , =ic id ^y 3 , (i) 

where n p is predicted from the local density contrast. Note that 
the case of zero neighbour number which may occur for distant 
particles is also catered for. Formally, explicit derivative corrections 
to the force polynomial should be carried out for each neighbour 
change. However, the same terms are added and subtracted from 
the respective polynomials so that this overhead may be omitted, 
provided the desired results are obtained at times commensurate 
with the maximum time-step (Makino & Aarseth 1992). 

The treatment of close encounters forms a large part of the 
code. We distinguish between two-body and multiple encounters 
which are studied by the tools of Kustaanheimo-Stiefel (1965, KS) 
and chain regularization (Mikkola & Aarseth 1993). Several meth- 
ods for integrating the KS equations of motion have been used over 
the years. The preferred method is a high-order Hermite scheme 
(Mikkola & Aarseth 1998) and an iterative solution without recal- 
culating the external perturbation. This allows the physical time to 
be obtained by a sixth-order Taylor series expansion of the time 
transformation %' — R. The so-called Stumpff method maintains 
machine accuracy in the limit of small perturbations and only re- 
quires half the number of steps per orbit compared to the KS fourth- 
order polynomial method employed by nbodyh. Even so, small 
systematic errors are present over long time intervals. Hence a rec- 
tification of the KS variables u, u' is performed to be consistent 
with the integrated value of the binding energy (Fukushima 2005). 
Further speed-up can be achieved for small perturbations by em- 
ploying the slow-down concept in which the time and perturbing 
force are magnified according to the principle of adiabatic invari- 
ance (Mikkola & Aarseth 1996). 

Given a population of hard binaries, close encounters between 



binaries and field stars or other binaries are an interesting feature 
particularly because collisions or violent ejections may occur. The 
chain concept led to a powerful method for studying strong inter- 
actions of 3-5 particles where two-body singularities are removed 
(Mikkola & Aarseth 1993). Implementation of perturbed chain reg- 
ularization introduces many complexities, as well as dealing with 
internal tidal effects, membership changes or post-Newtonian terms 
(Aarseth 2003). As far as the A-body code is concerned, the asso- 
ciated cm. is integrated like a single particle with the force and first 
derivative obtained by mass-weighted summation over the compo- 
nents. The internal motions are advanced by a high-order integra- 
tor (Bulirsch & Stoer 1966) with energy conservation better than 
10~ 10 . Moreover, the solutions for internal KS or chain are contin- 
ued until the end of the new block-step before the other particles are 
treated. As for KS termination, this is implemented exactly at the 
end by a simple iteration. Internal chain integration, on the other 
hand, is only advanced while the new time is less than the block- 
time and any coordinates required by other particles are obtained 
by prediction. Consequently, chain termination is performed at an 
arbitrary time and a new current commensurate time is constructed 
by suitable subdivision. This in turn reduces the block-step further 
but is compensated by the relatively small number of chain treat- 
ments. 

Finally, in this paper, we omit a discussion of synthetic stel- 
lar evolution which is optional and forms a large part of the code. 
Hence for the purpose of GPU developments these aspects may 
be ignored for simplicity. In any case, the additional CPU time re- 
quired by the host is relatively small. 



3 NEW IMPLEMENTATIONS 

We now turn to describing some relevant procedures associated 
with the new implementations. 



3.1 Software design 

A subroutine INTGRT in nbodys drives the numerical integra- 
tion of the AC neighbour scheme. It uses two libraries during the 
integration, prepared for the calculation of the regular and the ir- 
regular force, named GPUNB and GPUIRR. Although the names 
of the libraries include GPU, non-GPU implementations exist, e.g. 
plain C++ versions or high performance versions with SSE/AVX 
and OpenMP. The present nbody6~gpu code employs a GPU 
version written in CUDA (with multiple-GPU support) for GPUNB 
and a CPU version with an acceleration by SSE/AVX and OpenMP 
for GPUIRR. A first implementation of GPUIRR used GPU. How- 
ever, it was slower than a fine-tuned CPU code because of the fine- 
grained nature of the irregular force calculation. 

Fig. 1 illustrates a schematic diagram for the relation between 
intgrt . f , GPUNB and GPUIRR. Here, GPUNB receives posi- 
tions, velocities, masses and time-steps of the attracting particles 
(so called j-particles) and positions, velocities and neighbour radii 
of the attracted (or active) particles (i-particles), and it returns regu- 
lar forces, their time derivatives and neighbour lists. GPUIRR holds 
the position and up to its third time derivatives, mass, the time 
of last irregular integration and the current neighbour list of each 
particle, and returns the irregular force and its time derivative for 
active particles. After an irregular step of particle i, the variables 
Hi, Rj, Fj, Fj, ti,i and mi are sent to GPUIRR and after a regular 
step, the list Li is sent if there is a change. 
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intgrt . f 



Rj , Rj , rrij , At R j ; R; , R^ , 



GPUNB 



Fi.i , F I; i 



GPUIRR 



Figure 1. A schematic diagram of the NBODY6-GPU code. 



3.2 Velocity neighbour criterion 

The traditional neighbour criterion that a particle j is a neighbour 
of particle i is defined by 



|Ry| < hi 



(2) 



where Ry = Rj — Ri and hi is the radius of the neighbour sphere. 
In this implementation, we employ a modified criterion to include 
the velocities by the condition 



Rij,min — mill ( |R»j|, |R»j 



< hi 



(3) 



where At^i is the regular time-step. Although this increases the 
computational cost for each pairwise interaction evaluation in the 
regular force calculation, we can take larger regular time- steps for 
the same number of neighbours. This safety condition also ensures 
that high- velocity particles can be added to the neighbour list before 
they come too close. 



3.3 Block-step procedure 

Let us examine the sequential procedure for one block-step, 
(i) Obtain the next time for integration, 



tncxt = min (ti,i + Ati.i), 



(4) 



where ti,, and Aii,, are the time of the last irregular force calcula- 
tion and irregular time-step of particle i. 

(ii) Make the active particle list for regular and irregular force 
calculation, 



Lact.R — {i | tti,i + AtR.i — tnext} i 
Lact.I = {i | tl,< + Atl,! = inext} i 



(5) 
(6) 



where the subscripts R and I denote regular and irregular terms. 
Here, {i | cond.} defines a set of i such that it satisfies cond. 

(iii) Predict all particles needed for force evaluation. 

(iv) Calculate the irregular force and its time derivative for par- 
ticle i € L act ,I, 



Pi, 



R, 



R; 



|Ry| 3 



|R«I 6 



(7) 



(8) 



(v) Apply the corrector for the active irregular particles and de- 
cide the next time-step Ati.i. 



(vi) Accumulate the regular force and its time derivative for each 
active regular particle i 6 L act ,R and construct the neighbour list, 

JV ( Ri 



Vj'" H - 

j^i I (otherwise) 



> hi) 



f m = e n 



Li = {j | j # i, Rij,uiiu < hi} . 



(-Rij,min>hi) 

(otherwise) 



(9) 



(10) 



(11) 



(vii) Execute the regular corrector. Since the neighbour list Li 
has been updated, the force polynomials should be corrected to re- 
flect the difference between the old and new list. 



3.4 The GPUNB library 

The GPUNB library computes regular forces and creates neighbour 
lists for a given set of particles with single or multiple GPU(s). 
First, the predicted position, velocity and mass of Nj particles are 
sent from the host. The regular force and neighbour lists of Ni par- 
ticles are then computed. Therefore, NiNj pairwise interactions 
are evaluated in one call. Unless Ni is much smaller than Nj, the 
cost of the prediction and data transfer is not significant. 

The basic method to calculate the force on GPUs is common to 
the previously existing implementations (Nitadori 2009, Gaburov, 
Harfst & Portegies Zwart 2009), except for the treatment of neigh- 
bours. During the accumulation of forces from all Nj particles, 
when j is a neighbour, GPUNB skips the accumulation and saves 
the index j in the neighbour list. This procedure is applied for all 
Ni particles in parallel, using the many threads of the GPU. 

Now we discuss in more detail the force calculation proce- 
dure of GPUNB. Each i-particle is assigned to each thread of GPU. 
The position, velocity, neighbour radius, force, its time derivative 
and neighbour count of particle i are held on the registers of each 
thread. The position, velocity and mass of particle j are broadcast 
from the memory to all the threads in a thread-block, and forces on 
multiple i-particles are evaluated in parallel. If particle j turns out 
to be a neighbour of i, the index j is written to the neighbour list in 
the memory. 

The actual behaviour of each thread is given in Listing B2 with 
the CUDA C++ language. There is an 'if statement' for the neigh- 
bour treatment, which is translated into a mask operation by the 
compiler and has little impact on performance. If the branch is not 
removed, it would be a serious overhead for parallel performance. 

As well as the i-parallelism for multiple threads, j-parallelism 
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for multiple thread-blocks and multiple GPUs are also exploited. 
Thus, after the force computations, partial forces and partial neigh- 
bour lists of certain particles are distributed for multiple thread- 
blocks. To minimize the data transfer from GPU to host, we pre- 
pared kernels for force reduction and list gathering, where local 
indices in 16-bit integer are translated to global indices in 32-bit in- 
teger. Partial forces are summed and sparsely scattered partial lists 
are serialized to a linear array before they are sent back to the host 
PC. 

Finally, single precision arithmetic turned out to provide suf- 
ficiently accurate results for practical use. This is because all the 
close interactions are skipped in the regular force procedure^. 

3.5 The GPUIRRlibrary 

The name of the library GPUIRR comes from an early effort to ac- 
celerate the irregular force calculation on GPU, although the GPU 
is not used in the current implementation, just because it is slower 
than a well tuned CPU code. Still, the name of the library and 
its API are used in the fast version on CPU with SSE/AVX and 
OpenMP. 

Different from GPUNB, GPUIRR retains many internal states. 
The position and up to its third derivative, mass and time of the last 
integration of particle i are held to construct the predictor. These 
quantities are updated after each irregular step. Additionally, the 
neighbour list of particle i is saved for the computation of the irreg- 
ular force. The list is copied from the host routine intgrt . f after 
each regular step. When an irregular force on particle i is requested, 
the library calculates Fi,j and Fi,i. Actually, it can be performed 
in parallel, i.e. the library receives a list of irregular active particles 
Lact.i and returns an array of the force and its time derivative. 

The library is tuned for multi-core CPUs and SIMD instruc- 
tions such as SSE and AVX. An OpenMP parallelization for dif- 
ferent i-particles is straightforward. On the other hand, the SIMD 
instructions are exploited for the j-parallelism for one i-particle, 
which requires technical coding. To calculate pairwise forces on 
a particle i from multiple (4-way for SSE, 8-way for AVX) j- 
particles, we need to gather them from non-contiguous addresses 
indicated by the neighbour list. The gathering process is relatively 
expensive and the GFLOPS rating or 'number of interactions per 
second' is decreased to about 40 percent of the case of brute force 
calculations (Tanikawa et al. 2012) on the same processor with the 
same AVX instruction set. 

Unlike the regular force calculations, full single precision cal- 
culation of the irregular force may not be accurate enough during 
close encounters. Thus, we employ the so-called 'two-float' tech- 
nique to express the coordinates of each particle (Nitadori 2009, 
Gaburov et al. 2009). For the summation of force, single precision 
turned out to be sufficient since we do not accumulate too many 
terms (more than a few hundred) in the irregular force calculation. 

Prediction of positions and velocities are performed inside the 
library. When A r act ,i is small, predicting all the N particles is not 
efficient. Thus, at some point (cf. NPACT in Table CI), we switch 
to predict only the necessary particles in this block-step. Even if we 
only predict the necessary particles, combining and sorting of the 
neighbour lists or random memory access may be expensive. Thus 
there is a turn-around point, and an example is shown in the lines 
24-29 in Listing CI. 

2 We still left options using the 'two-float' method for the coordinates and 
the accumulator. 



3.6 Binary effects 

The presence of binaries poses additional complications when com- 
bining results of force calculations from the host and the library for 
irregular force. As can be seen above, the irregular force on a sin- 
gle particle due to other single particles is of simple form and its 
calculation can be speeded up in C++ with SSE/AVX and OpenMP. 
This is no longer the case for regularized systems where differential 
force corrections are needed on the host to compensate for the cm. 
approximation which is employed. For simplicity we assume that 
only the irregular force needs to be corrected; this in turn implies 
that the neighbour radius is sufficiently large. A further complica- 
tion arises in the force evaluation due to perturbers. The numerical 
problem can be illustrated by considering the new regular force dif- 
ference during the corresponding time interval t — t Ri ;, 



AFr = (F R low - F° ld ) + (Ff w - F° M ) , (12) 



where Fr and Fr cw denote regular forces evaluated at the old 
and new times t^i and t, respectively, while F° ld and FJ ew de- 
note irregular forces both evaluated at the new time t with the old 
and new neighbour lists. Hence the net change of irregular force is 
contained within the second bracket. In the case of no neighbour 
change, this term should be suitably small, otherwise it would tend 
to reduce the regular time-step. It is therefore important to ensure 
consistency between the irregular force as calculated at each time- 
step and that obtained elsewhere at the end of a regular step where 
both are based on the same predicted quantities. 

In the case of an active KS binary, the old and new force con- 
tributions due to perturbers are acquired in double precision on 
the host and similarly for the cm. term which is subtracted. Since 
the latter is evaluated in the single precision GPUIRR library, this 
means that a small discontinuity is introduced. However, the old 
and new irregular force are still numerically identical in the ab- 
sence of neighbour change and hence the same neighbours do not 
affect the regular force difference. A similar differential force pro- 
cedure is carried out for single particles having perturbed binaries 
as neighbours. Note that the sequential ordering of neighbour lists 
facilitates the identification of KS binaries for various purposes. 
Finally, force corrections involving a chain cm. particle are per- 
formed analogously, where the internal contributions to F and F 
are obtained by summation over the members. 

The strategy for controlling the number of neighbours also 
changes when using the GPU. Unless present initially, binaries 
eventually become an important feature of iV-body simulations. It 
is therefore desirable to strike a balance between the maximum size 
of regularized binaries and the neighbour radius. This can read- 
ily be done for long-lived binaries, subject to the maximum mem- 
bership denoted by the parameter NBMAX. The average neighbour 
number can also be controlled to a certain extent. This is achieved 
by an optional adjustment of a density contrast parameter a used in 
the predicted neighbour number n p of equation {T}. Fortunately the 
ratio Rs/a becomes more favourable for large N because the semi- 
major axis of hard binaries is oc 1/N while the neighbour radius 
reduces more slowly. Hence a safe strategy for small and interme- 
diate simulations is to employ a relatively large value of NBMAX 
together with fine-tuning of a. Also note that the possible problem 
of too small neighbour radii is mostly relevant for the central cluster 
region because i? s tends to be larger at lower density. 
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Table 1. Hardware and software configurations. 





System A 


System B 


CPU 


Core i7-920 


Core i7-2600K 


(spec) 


4 cores, 2.66 GHz 


4 cores, 3.40 GHz 


(SIMD) 


SSE4.2 (4-float) 


AVX (8-float) 


GPU 


2 X GeForce GTX 470 


2 x GeForce GTX 560 Ti 


Motherboard 


MSI X58 


MSI P67 


Memory 


6 GB, DDR3-1333 


8 GB, DDR3-1600 


OS 


CentOS 5.5 x86.64 


CentOS 6.0 x86.64 


Compilers 


GCC 4.1.2, CUDA3.0 


GCC 4.6.1, CUDA 4.0 



Table 2. Performance summary for System B. We show total wall-clock 
time and partial time for regular and irregular force. The number of regu- 
lar and irregular individual steps is also given, together with the time-step 
weighted average neighbour numbers. 



N 


32k 


64k 


128k 


256k 


T„aii/sec 


18 


59 


209 


802 


T rcg /sec 


5.3 


19 


84 


338 


T irr /sec 


4.7 


13 


47 


230 


A/"r C g/10 6 


2.8 


6.1 


13 


28 


Mrr/10 6 


33 


85 


220 


550 


Mrr/A/'rcg 


11.8 


13.9 


16.9 


19.6 


<AU} 


46 


55 


64 


83 



4 PERFORMANCE 

The acid test of any code is measured by its performance. This 
is especially the case for TV-body simulations where an important 
challenge is to describe the dynamics of globular clusters. Here we 
report on some performance tests to illustrate the calculation cost 
for a wide range of particle numbers. It should be emphasized that 
timings based on idealized systems do not demonstrate the capa- 
bility of dealing with more advanced stages which are character- 
ized by large density contrasts and the presence of hard binaries. 
Additional examinations of more realistic conditions are therefore 
highly desirable, as commented on below. 

For the basic performance tests we study isolated Plummer 
models in virial equilibrium with equal-mass particles in the range 
N = 8-256 k. We use standard iV-body units with total energy 
E — —0.25 and total mass 1. The benchmarks use two hard- 
ware systems, A and B, where the specifications are summarized 
in Table 1. Timing tests were first made for the older System A. 

In order to have a more balanced dynamical state, the com- 
puting times are measured from t = 2 to t = 4. The wall-clock 
time (in sec) as a function of N is shown in Fig. 2 for both systems. 
Also shown separately are the times for the regular and irregular 
force calculation. As can be seen, these timings are quite similar 
for System A, especially in the large- limit. The minimum cost is 
achieved for particular choices of the upper limit of the neighbour 
number, NBMAX, which are close to values used in long-term sim- 
ulations. As for the remaining time expended by the FORTRAN 
part, it forms a diminishing fraction of the total effort, with about 
40 percent for N = 32 k and 20 percent for N = 256 k. 

Timing comparisons for the standard nbody6 code have also 
been carried out. Thus at the upper limit of N — 24 000, the wall- 
clock time is 56 times the corresponding time for System A, with an 
asymptotic dependence r com p oc AT 18 in both cases. Excluding the 
time consumption of the other parts, the performance of the regular 
force calculation with System A exceeds 1 TFLOPS. As shown in 
Fig. 2, the CPU times for the irregular force with System B are 
almost a factor of 2 faster than for System A when the slightly 
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Figure 2. Wall-clock times for the original NBODY6 (0) and the 
NBODY6-GPU (★) code for the interval 2 < t < 4. Partial times for 
the regular and irregular force are given by squares and triangles, respec- 
tively. Filled symbols and solid lines are for system A and open symbols 
and dotted lines for system B. 
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Figure 3. Wall-clock times for the NBODY6-SSE code (♦, 0) and 
NBODY6-GPU code with one GPU (■, □) and two GPUs (A, A) for the 
interval 2 ^ t ^ 4. Solid lines with filled symbols (♦, ■ A) are for the 
total times and dotted lines with open symbols (0, □, A) for regular force 
times. 
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faster clock is included. This improvement is mainly due to the use 
of AVX instructions which became available very recently. More 
precise timings for System B are presented in Table 2, as well as 
time-step counts and average neighbour numbers for different TV. 
In comparison, the actual timings for TV = 256 k and System A 
were 911, 366 and 354 sec, respectively 0. We also note that the 
average neighbour number scales approximately as TV 1//3 . 

The difference between using one and two GPUs is of inter- 
est. In Fig. 3, we show timings on System A using one and two 
GPUs, as well as a fully tuned CPU version with SSE and OpenMP 
(nbody6-SSe). The wall-clock times are plotted in filled sym- 
bols with solid lines, and times for the regular force in open sym- 
bols with dotted lines. The performance gain in computing regular 
forces from the SSE version to one GPU is about a factor of 10, 
and this is doubled when going from one GPU to two GPUs. How- 
ever, the linear speed-up of the regular force calculations does not 
always correspond to a good scalability of the total simulation time 
because of the time consumed by the other parts. For example with 
TV = 64 k, the regular force time is reduced from 45 to 23 sec by 
using two GPUs, although this becomes 99 and 77 sec for the total 
time. Hence two separate simulations may be made simultaneously 
on a dual GPU machine to keep the computing units busy, provided 
TV is not too large. 

The question of reproducibility in the multi-thread environ- 
ment is a difficult one, particularly in view of the chaotic nature ex- 
hibited by the TV-body problem. On the other hand, the calculations 
are speeded up significantly by using parallel OpenMP procedures. 
At present, two important parallel treatments (lines 46-50 and 104— 
115 of Listing CI) are not thread-safe while some other FORTRAN 
parts are well behaved. Consequently, strict reproducibility can be 
enforced by omitting these procedures at some loss of efficiency. 
However, the timing tests were carried out with full optimization 
since the early stage is essentially reproducible in any case. 

The performance tests employed standard time-step param- 
eters (rj — 0.02) for the irregular and regular force polynomi- 
als. Typical relative energy errors for the time interval quoted are 
AE/E ~ 1.5 x 10" 7 (TV = 128 and 256 k; also TV = 16 k). This 
compares favourably with values 4 x 10~ 7 for the original code 
(TV = 8 and 16 k). It should be noted that the intrinsic relative error 
in potential energy evaluated on the GPU for efficiency is 1 x 10~ 8 
and hence on the safe side. 

Most TV-body simulations, whether they be core collapse or 
substantial evaporation, are concerned with long time-scales. It is 
therefore desirable to assess the code behaviour for more advanced 
stages of evolution with high core density which inevitably leads 
to binary formation and strong interactions in compact subsystems. 
At this stage the special treatments of close encounters in the form 
of KS and chain regularization begin to play an important role. The 
ejection of high-velocity members is a hallmark of an evolved dy- 
namical state. In general, the main energy errors are due to strong 
interactions, especially in connection with switching to or from 
regularization procedures. Even so, a study of core collapse for 
an equal-mass system with TV = 32 k showed that the accumu- 
lated changes in total energy are surprisingly small, amounting to 
J2 AE i = -lx 10 -4 at minimum core radius. A comparable drift 
in total energy was also seen in a test calculation well beyond core 
collapse for TV = 16 k. Hence the Hermite integration scheme has 
proved to possess excellent long-term stability. 



3 Because of cache miss, there is a degradation above TV = 64 k for the 
irregular force calculation with AVX. 



5 CONCLUSIONS 

We have presented new implementations for efficient integration 
of the TV-body problem with GPUs. In the standard nbodys 
code, the regular force calculation dominates the CPU time. Con- 
sequently, the emphasis here has been on procedures for speeding 
up the force calculation. First the regular force evaluation was im- 
plemented on the GPU using the library GPUNB which also forms 
the neighbour list. This procedure is ideally suited to massively par- 
allel force calculations on GPUs and resulted in significant gains. 
However, a subsequent attempt to employ the GPU for the irreg- 
ular force showed that the overheads are too large. It turned out 
that different strategies are needed for dealing with the regular and 
irregular force components and this eventually led to the develop- 
ment of the special library GPUIRR. The use of SSE and OpenMP 
speeded up this part such that the respective wall-clock times are 
comparable for a range of particle numbers. 

After the recent hardware with AVX support became avail- 
able, the library was updated. This led to additional speed-up of 
the irregular force calculation. It is also essential that the regular 
force part scales well on multiple GPUs. Thus in the future, fur- 
ther speed-up may be achieved by using four GPUs and an octo- 
core CPU. It should be noted that in the present scheme, the use 
of multiple GPUs only benefits the regular force calculation which 
therefore scales well. Although large TV-body simulations are still 
quite expensive, we have demonstrated that the regular part of the 
Ahmad-Cohen neighbour scheme is well suited for use with GPU 
hardware. Moreover, the current formulation of the nbody6-Gpu 
code performs well for a variety of difficult conditions. 
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float3 vel; 
float pad; // 

}; 

struct Iparticle{ 
float3 pos; 
float h2; 
float3 vel; 
float dtr; // 

}; 

struct Force { 
float3 acc; 
float pot; 
float3 jrk; 
int nnb; / / 

}; 



APPENDIX A: GLOSSARY 

GCC GNU Compiler Collection, including gcc, g++, 
gf ortran and other languages. 

API Application Programing Interface. This provides definitions 
or prototypes of FORTRAN subroutines or C functions. 

SIMD Single Instruction Multiple Data. A model of parallel 
computer where an operation such as addition and multiplication 
is performed for multiple data. 

SSE Streaming SIMD Extensions. Additional instruction set for 
x86. Four words single precision floating point operations on 128- 
bit registers are supported. 

AVX Advanced Vector extensions. Further enlargement of SSE. 
Eight words single precision or four words double precision float- 
ing point operations on 256-bit registers. 

CUDA Compute Unified Device Architecture. A frame-work for 
general purpose computing on NVIDA GPUs, including language, 
compiler, run-time library and device driver. 

OpenMP A standard to utilize multiple processors on shared 
memory from high-level languages, provided as directives of the 
language. 

Thread A unit of parallel execution. Threads of CPU execute dif- 
ferent contexts, while a cluster of threads of GPU executes the same 
context for different data. 

Thread-Block A number of GPU threads which share the same 
context. 

Kernel A short program submitted and executed on GPU. 
i-particle A particle which feels the gravitational force, named 
from the outer loop index. 

y-particle A particle which is a source of the gravitational force, 
named from the inner loop index. 

/-parallelism Parallelism for the outer loop. Forces on different 
particles are calculated in parallel. 

y'-parallelism Parallelism for the inner loop. Forces from differ- 
ent particles are calculated in parallel, and summed later. 



APPENDIX B: CUDA CODES 

In Listing B 1 and B2, we show the innermost regular force kernel 
written in CUDA C++ as well as some definitions of data struc- 
tures. Each CUDA thread holds one i-particle in its registers, and 
evaluates a force from particle j and accumulates it. The functions 
qualified with __device_ are forced to be in-line. The argument 
nblist is a pointer to the device memory. 

Listing Bl: Definitions of structures. 





Listing B2: The innermost g 


ravity calulation. 


1 


// num of neib per block, must be power of 2 


2 


#define NNB_PER_BLOCK 256 




3 
4 


device void dev gravity ( 




5 


const int jidx, 




6 


const Iparticle sip, 




7 


const Jparticle &jp, 




8 


Force sfo, 




9 


uintl6 nblist [] ) 




10 


{ 




1 1 


float dx = jp.pos.x - ip.pos.x 




12 


float dy = jp.pos.y - ip.pos.y 




13 


float dz = jp.pos.z - ip.pos.z 




14 


float dvx = jp.vel.x - ip.vel.x 




15 


float dvy = jp.vel.y - ip.vel.y 




16 


float dvz = jp.vel.z - ip.vel.z 




17 


float dxp = dx + ip.dtr * dvx; 




18 


float dyp = dy + ip.dtr * dvy; 




19 


float dzp = dz + ip.dtr * dvz; 




20 






21 


float r2 = dx *dx + dy *dy + 


dz *dz; 


22 


float r2p = dxp*dxp + dyp*dyp + 


dzp*dzp; 


23 


float rv = dx *dvx + dy *dvy + 


dz *dvz; 


24 






25 


float rinvl = rsqrtf(r2); 




26 


if(fminf(r2, r2p) < ip.h2){ 




27 


// address to avoid buffer overflow 


28 


int addr = f o . nnb & (NNB_PER_BLOCK -1); 


29 


nblist [addr] = {uintl6} jidx; 


30 


to . nnb++; 




31 


rinvl = O.f; 




32 


1 




33 


float rinv2 = rinvl * rinvl; 




34 


float mrinvl = jp.mass * rinvl; 




35 


float mrinv3 = mrinvl * rinv2; 




36 


float alpha = -3.f * rv * rinv2; 


37 






38 


#ifdef POTENTIAL 




39 


fo.pot += mrinvl; 




40 


#endif 




41 


f o . acc . x += mrinv3 * dx; 




42 


fo.acc.y += mrinv3 * dy; 




43 


f o . acc . z += mrinv3 * dz; 




44 


fo.jrk.x += mrinv3 * (dvx + alpha * dx) ; 


45 


fa. jrk. y += mrinv3 * (dvy + alpha * dy) ; 


46 


fo.jrk.z += mrinv3 * (dvz + alpha * dz); 


47 


} 




In lines 17-19, 22 and 26, we can see the additional operations for 



struct Jparticle{ 
float3 pos; 
float mass; 



the velocity criterion (5 mul, 6 add, 1 min) discussed in Section 3.2. 



APPENDIX C: FORTRAN CODES 

It is instructive to examine the program flow for the treatment of the 
regular and irregular force during one block-step. We have copied 
the relevant FORTRAN parts, omitting some extra features, and 
display a complete cycle in the general case. Consider the situa- 
tion when the next block-step time, denoted by TMIN, has been 
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Table CI. List of symbols. 



Symbol 


Data type 


Definition 


LMAX 


PARAMETER 


Size of compiled neighbour arrays 


NBMAX 


PARAMETER 


Maximum neighbour number 


NIMAX 


PARAMETER 


Block-size of regular force loop 


NMAX 


PARAMETER 


Maximum particle number 


NPACT 


PARAMETER 


Limit for active particle prediction 


NPMAX 


PARAMETER 


Parallel irregular integration 


IFIRST 


INTEGER. 


Array index of first single particle 


NFR 


INTEGER 


Number of regular force calculations 


NQ 


INTEGER 


Membership of listq 


NTOT 


INTEGER 


Number of single particles and KS pairs 




TMT 1 MMA Y ^ 
_L In i. \ IN l v lrtj\ } 


I i nnvtir'lp*; rlnp hpfnrp n trivpn timp 


NXTLEN 


INT (NMAX) 


Length of current block-step list 


NXTLST 


INT (NMAX) 


List of block-step members 


RS 


REAL (NMAX) 


Radius of individual neighbour sphere 


STEPR 


REAL (NMAX) 


Regular time-step 


10 


REAL (NMAX) 


Time of last irregular force calculation 


TMIN 


REAL (NMAX) 


New block-time 


TNEW 


REAL (NMAX) 


Next irregular force time 


X 


REAL (3, NMAX) 


Predicted coordinates 


X0 


REAL (3, NMAX) 


Corrected coordinates 


XDOT 


REAL (3, NMAX) 


Predicted velocities 


X0DOT 


REAL (3, NMAX) 


Corrected velocities 



determined at the end of the previous step (as shown below). All 
particles due for consideration up to a certain time are contained in 
LISTQ, and the new block-step members NXTLEN which satisfy 
equation Q are saved in the list array NXTLST after the first call 
to INEXT. Procedures for advancing KS and chain solutions which 
usually follow here are omitted for simplicity. The essential sym- 
bols are defined in Table CI, where some are fixed parameters. For 
completeness, we also include the OPEN and CLOSE statements, 
normally only used at the start and end. With this preamble and the 
definitions of Table CI the code section can now be inspected with 
assistance from the explanatory comments. The purpose of most 
special procedures have already been discussed in the text and the 
naming is intended to be descriptive. 

Listing CI: extract of intgrt . omp . f . 



I 

2 
3 
4 
5 
6 
7 
8 
9 
10 
1 1 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 



Declare typical parameters for N=64k. 
PARAMETER(NPACT=150, NPMAX=16, NIMAX=1024) 

Open the regular and irregular libraries. 
CALL GPUNB_OPEN (NTOT ) 
CALL GPUIRR_OPEN (NTOT, LMAX) 

Find all particles in next block ( TNEW = 
CALL INEXT (NQ, LISTQ, TMIN, NXTLEN, NXTLST ) 
TIME = TMIN 



Form lists of candidates 

NFR = 

DO 28 L = 1, NXTLEN 

J = NXTLST (L) 

IF (TNEW(J) .GE.TOR(J) 
NFR = NFR + 1 
IREG (NFR) = J 
IRR(L) = J 

ELSE 

IRR(L) = 
END IF 
2 3 CONTINUE 



STEPR (J)) THEN 



Decide between predicting <- NPACT active ( NFR= ) or all par 
IF (NXTLEN. LE. NPACT. AND. NFR. EQ.O) THEN 

CALL GPUIRR_PRED_ACT (NXTLEN, NXTLST, TIME) 
ELSE 

CALL GPUIRR_PRED_ALL( IFIRST, NTOT, TIME) 
END IF 



CALL GPUIRR_FIRR_VEC (NXTLEN, NXTLST, GF, GFD) 



Choose between standard and parallel 
IF (NXTLEN. LE. NPMAX) THEN 



GPUIRR libr 



Correct the irregular 
DO 48 II = 1, NXTLEN 



39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
49 
50 
51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
64 
65 
66 
67 
68 
69 
70 
71 
72 
73 
74 
75 
76 
77 
78 
79 
80 
81 
82 
83 
84 
85 
86 
87 
88 
89 
90 
91 
92 
93 
94 
95 
96 
97 
98 
99 
100 
101 
102 
103 
104 
105 
106 
107 
108 
109 
110 
I 1 1 
112 
113 
1 14 
115 
116 
117 
118 
119 
120 
121 
122 
123 
124 
125 
126 
127 
128 
129 
130 
131 
132 
133 
134 
135 
136 
137 
138 
139 
140 
141 
142 
143 



I = NXTLST (II) 

CALL NBINT (I,IRR(II),GF(I,II), GFD (1, II) ) 
CONTINUE 



ELSE 



allel. 



* Perform irregular correction t 

' Somp parallel do private! II , I) 
DO 50 II = 1, NXTLEN 
I = NXTLST (II) 

CALL NBINTP (I, IRS (II) , GF (1,11), GFD ( 1 , 1 1 ) ) 
5B CONTINUE 
'Samp end parallel do 
END IF 



Check regular fore 
IF (NFR. GT . ) THEN 



updat, 



(NFR i 



* Predict all particles (except TPRED-TIME) in C++ on host. 
CALL CXVPRED ( IFIRST, NTOT, TIME, TO , X0 , X0DOT, F , FOOT, X, XDOT, TPRED) 

* Send all single particles and cm bodies to the GPU. 

NN = NTOT - IFIRST + 1 

CALL GPUNB_SEND (NN, BODY (IFIRST) , X(l, IFIRST) , XDOT ( 1 , IFIRST) ) 

* Perform regular force loop (blocks of NIMAX=I024j. 

JNEXT = 

DO 55 II = 1, NFR, NIMAX 

NI = MIN (NFR-JNEXT, NIMAX) 

* Copy neighbour radius. STEPR and state vector for each block. 
' Somp parallel do privatefLL. I. K) 

DO 52 LL = i,NI 

I = IREG ( JNEXT+LL) 
H2I (LL) = RS (I) "2 
DTR(LL) = STEPR(I) 
DO 51 K = 1,3 

XI (K, LL) = X (K, I) 

VI (K, LL) = XDOT (K, I) 

51 CONTINUE 

52 CONTINUE 
'Somp end parallel do 

* Evaluate forces, derivatives and neighbour lists for new block. 

CALL GPUNB_REGF (NI,H2I,DTR,XI,VI, GPUACC, GPUJRK, GPUPHI, LMAX, 

5 NBMAX, LISTGP) 

* Copy neighbour lists from the GPU. 

' Somp parallel do private! LL, I. ITEMP. NNB, LI. L) 
DO 5 6 LL = 1,NI 

I = IREG (JNEXT + LL) 
NNB = LISTGP (1,LL) 
LI = 1 

DO 53 L = 2,NNB+1 

* Note GPU address starts from {hence add IFIRST to neighbour list! 

ITEMP = LISTGP (L,LL) + IFIRST 
IF ( ITEMP . NE . I ) THEN 

LI = LI + 1 

LISTGP (LI, LL) = ITEMP 
END IF 

53 CONTINUE 

LISTGP (1, LL) = LI - 1 

CALL GPUIRR_SET_LIST (I, LISTGP ( 1 , LL ) ) 

56 CONTINUE 

' $Omp end parallel do 

* Evaluate current irregular forces by vector procedure. 

CALL GPUIRR_FIRR_VEC (NI, IREG (II) , GF ( 1 , 1 ) , GFD (1, 1) ) 
'Somp parallel do privatejLL. I, LX) 
DO 57 LL = 1,NI 

I = IREG (JNEXT+LL) 

* Send new irregular force and perform Hermile corrector. 

CALL GPUCOR(I,XI (1, LL) , VI (1, LL) , GPUACC (1, LL) , GPUJRK ( 1 , LL) , 

6 GF (1, LL) , GFD (1, LL) , LISTGP (1, LL) , LX) 

* Update neighbour lists in GPUIRR library I only if changed: LX > ) . 

IF (LX.GT.0) THEN 

CALL GPUIRR_SET_LTST(I, LIST (1, I) ) 
END IF 

57 CONTINUE 
'Somp end parallel do 

JNEXT = JNEXT + NI 
55 CONTINUE 
END IF 



STEP . 



GPUCOR) 



* Determine next block lime 
TMIN = 1.0D+10 

DO 60 L = 1, NXTLEN 
I = NXTLST (L) 

IF (TNEW (I) .LT.TMIN) THEN 

TMIN = TNEW (I) 
END IF 
60 CONTINUE 

'Somp parallel da private! I. L. K) 
DO 70 L = 1, NXTLEN 
I = NXTLST (L) 
DO 65 K = 1, 3 

X (K, I) = X0 (K, I) 
XDOT(K,I) = X0DOT(K,I) 
65 CONTINUE 

* Send corrected active particles to GPUIRR library. 

CALL GPUIRR_SET_JP (I,X0(l,I),X0DOT(l,I),F(l,I),FDOT(l,I), 
£ BODY (I) , TO (I) ) 

70 CONTINUE 
'Somp end parallel do 
CALL GPUNB_CLOSE 
CALL GPUIRR_CLOSE 
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